Many important loop preparation transformations involve reassociation of floating point values. See the discussion of floating point optimization above, especially the "-OPT:roundoff=n" option.
SWP must normally be careful during the initial and final iterations of a loop to not perform extra operations that may cause run-time traps. It must be similarly careful if early exits from a loop (that is, before the initially calculated trip count is reached) are possible. Turning off certain traps at run time can give it more flexibility, producing better schedules and/or simpler wind-up/wind-down code. See the target environment option -TENV:X=n for general control over the exception environment.
DO i=1,n
a[i] = a[i-1] + 5.0
END DO
Without back-substitution, each iteration must wait for the previous iteration's add to complete, yielding a best case 2 of 4 cycles per iteration on the R8000. Back-substitution can transform the loop to code equivalent to:
DO i=1,n
a[i] = a[i-8] + 40.0
END DO
With appropriate initialization, this version can achieve an effective iteration interval of nearly 0.5 cycles.
For example, compile tomcatv (a spec benchmark) with:
f77 -S -64 -O3 -mips4 -OPT:IEEE_arith=3:ro=3 -LIST:=ON tomcatv.f
A listing file tomcatv.L is produced. The results for the loop at line 32 in look like this:
Compiling tomcatv.f (tomcatv.f) Options: -O3 (Optimization level) -g0 (Debug level) -m1 (Report warnings) -TARG: (Target group) abi=64 (64-bit ABI) isa=mips4 (Instruction Set Architecture) processor=R8000 madd=ON (Allow madd instructions) -TENV: (Target environment group) PIC=ON (Shared code) small_GOT (Assume GOT < 64KB) no_page_offset=OFF (Use page/offset addressing) short_data=8 (Size limit for short data objects) short_literals=8 (Size limit for short literals) misalignment=0 (Misaligned data model) align_aggregrates=8 (Forced aggregate alignment) use_fp=OFF (Force frame pointer use) varargs_prototypes=ON (Require prototypes for varargs routines) X=1 (Exception suppression model) -OPT: (Optimization group) div_split=OFF (Use a*(1/b) for a/b) fast_complex=OFF (Use fast complex norm/sqrt) fast_exp=OFF (Use fast exp algorithm) fast_sqrt=OFF (Use 1/rsqrt(x) for sqrt(x)) fast_io=OFF (Use fast I/O intrinsics) fold_aggressive=OFF (Use aggressive expression folding) IEEE_arithmetic=3 (Level of IEEE-754 compliance) IEEE_comparisons=OFF (Don't eliminate comparisons like x==x) roundoff=3 (Level of roundoff errors allowed) space=OFF (Optimize code space over execution time) vector_intrinsics=OFF (Use vector intrinsics) -LIST: (Listing group) =ON (Produce listing file) file=tomcatv.L (Listing file name) performance=ON (List performance information) source=OFF (List source code) symbols=OFF (List symbol table) ... #<swps> Pipelined loop line 32 steady state #<swps> #<swps> Not unrolled before pipelining #<swps> 4 cycles per iteration #<swps> 1 flop ( 6% of peak) (madds count as 2) #<swps> 1 flop ( 12% of peak) (madds count as 1) #<swps> 0 madds ( 0% of peak) #<swps> 8 mem refs (100% of peak) #<swps> 3 integer ops ( 37% of peak) #<swps> 12 instructions ( 75% of peak) #<swps> 4 short trip threshold #<swps> #<swps> 4 possible stall cycles #<swps> 4 min possible stall cycles #<swps> #<swps>If you add the switch -SWP:body_ins=250, the amount of unrolling increases. The loop at line 32 becomes:
#<swps> Pipelined loop line 32 steady state #<swps> #<swps> 2 unrollings before pipelining #<swps> 7 cycles per 2 iterations #<swps> 2 flops ( 7% of peak) (madds count as 2) #<swps> 2 flops ( 14% of peak) (madds count as 1) #<swps> 0 madds ( 0% of peak) #<swps> 14 mem refs (100% of peak) #<swps> 3 integer ops ( 21% of peak) #<swps> 19 instructions( 67% of peak) #<swps> 2 short trip threshold #<swps> #<swps> 6 possible stall cycles #<swps> 4 min possible stall cycles
DO i=1,n
IF ( a(i) .LT. b(i) ) THEN
c(i) = a(i)
ELSE
c(i) = b(i)
END IF
END DO
The loop body is compiled for MIPS IV as:
ldc1 $f0,a(i)
ldc1 $f1,b(i)
c.lt.s cc,$f0,$f1
movf.s $f0,$f1,cc
sdc1 $f0,c(i)
Note that no conditional branches occur in the code. This option is ON by default for MIPS IV targets only.
DO i=1,n
sum = sum + a(x)
END DO
Without interleaving, each iteration must wait for the previous iteration's add to complete, yielding a best case iteration interval of 4 cycles/iteration on the R8000. Interleaving can transform the loop to something equivalent to:
DO i=1,n,8
sum1 = sum1 + a(i)
sum2 = sum2 + a(i+1)
sum3 = sum3 + a(i+2)
sum4 = sum4 + a(i+3)
sum5 = sum5 + a(i+4)
sum6 = sum6 + a(i+5)
sum7 = sum7 + a(i+6)
sum8 = sum8 + a(i+7)
END DO
sum = sum1+sum2+sum3+sum4+sum5+sum6+sum7+sum8
This version can achieve an effective iteration interval of nearly 0.5 cycles. These transformations generally require -OPT:roundoff=2 or better.